An exploration of train delay data

Author
Affiliation
Published

2025-05-13

Purpose of this quarto-html

  • To explore the dataset from Zhang et al.
  • To try to determine if we can find a link between weather and train delays.

Lets begin by having a look at some very basic features of the dataset we have just found:

Dimensions
[1] 2751713      16
Column names
 [1] "date"                     "train_number"            
 [3] "train_direction"          "station_name"            
 [5] "station_order"            "scheduled_arrival_time"  
 [7] "scheduled_departure_time" "stop_time"               
 [9] "actual_arrival_time"      "actual_departure_time"   
[11] "arrival_delay"            "departure_delay"         
[13] "wind"                     "weather"                 
[15] "temperature"              "major_holiday"           

The dataset seems well structured, but is also quite large. It does contain some useful headers and columns.



Overview of the data - Summary statistics

To get a quick glimpse of the data we can have a look at some summary statistics.

station_name Mean_arriv Mean_depar stdev_arriv stdev_delay n unique_arriv unique_dep Mean_temp
Jianwei Railway Station 532.0000 532.0000 0.00000 0.00000 29 1 1 10.793103
Yuzhou Railway Station 531.4444 531.4444 75.67617 75.67617 36 3 3 5.805556
Guanyun Railway Station 500.0132 500.0132 178.21474 178.21474 151 9 9 6.622517
Fangcheng Railway Station 485.4444 485.4444 75.67617 75.67617 36 3 3 6.055556
Jieshounan Railway Station 465.2176 465.2176 359.13111 359.13111 239 18 18 5.941423
Xingandong Railway Station 416.3605 416.3605 212.42152 212.42152 147 10 10 11.476190
train_number Mean_arriv Mean_depar stdev_arriv stdev_delay n unique_arriv unique_dep Mean_temp
G4027 853.1429 696.7143 382.2505 542.9797 7 7 7 28.85714
G4919 840.0000 422.6667 653.4977 701.7814 6 3 3 23.66667
G4950 826.5000 642.6667 410.2257 578.8743 6 6 6 22.66667
G9252 811.0000 722.3077 253.9600 418.3552 13 13 13 19.92308
G4923 801.0000 531.0000 534.6631 640.0818 4 4 4 24.75000
G4966 661.2500 447.7500 411.7026 502.2102 8 4 4 22.00000



Summary stats can also be plotted






Looking at the data (and not the summary stat)

Summary statistics can be informative and help us understand data, but they can also obfuscate problems in a dataset.

Plotting individual datapoints can help when exploring a new set of data. So let’s look at the departures from a few stations:



Subsetting the data

In order to do a cursory analysis to try to answer our question we opt for subsetting the data.




What does the subset look like?

With a smaller dataset we can more easily plot out individual datapoints.

Interactivity can be very helpful tool when trying to understand visuals and outputs. Especially for larger datasets it can be a timesaver.



After outlier removal




Basic analysis of relation between weather and departure delays

Having found a reasonable subset, we want to see if we can use this to try to answer our question. So we plot temperature vs Departure delays and fit a curve to the data.

There is a small association, lower temp seems to mean greater delay. Significant but a very minor effect.




References